Skip to content

fix: translate SIMILAR TO pattern to PostgreSQL-compatible regex#22273

Draft
Dandandan wants to merge 1 commit into
apache:mainfrom
Dandandan:fix/similar-to-postgres-compat
Draft

fix: translate SIMILAR TO pattern to PostgreSQL-compatible regex#22273
Dandandan wants to merge 1 commit into
apache:mainfrom
Dandandan:fix/similar-to-postgres-compat

Conversation

@Dandandan

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

SIMILAR TO is supposed to mix SQL LIKE wildcards (%, _) with POSIX regex metacharacters and match the entire string. DataFusion was lowering SIMILAR TO directly to a RegexMatch over the pattern verbatim, so:

  • 'abc' SIMILAR TO 'a%' returned false (PG: true) because % was passed through as a literal regex character.
  • 'abc' SIMILAR TO 'b' returned true (PG: false) because the regex match isn't anchored.
  • ., ^, $ were treated as regex metacharacters instead of literals.

What changes are included in this PR?

  • New translate_similar_to_pattern() helper in datafusion/physical-expr/src/expressions/binary.rs that converts a SIMILAR TO pattern into an equivalent POSIX regex:
    • Anchors the result with ^...$.
    • %.*, _..
    • Escapes ., ^, $ (literal in SIMILAR TO, meta in regex).
    • Passes POSIX metas (|, *, +, ?, (), {m,n}, [...]) through unchanged.
    • Handles \ as an escape for the next character.
    • Passes bracket expressions through verbatim, including a leading literal ]/^].
  • Planner (datafusion/physical-expr/src/planner.rs) now translates literal Utf8 / LargeUtf8 / Utf8View patterns at planning time. NULL patterns flow through as a typed Utf8 null. Non-literal patterns return a clear not_impl_err rather than the previous silently-wrong behavior.
  • SQL layer (datafusion/sql/src/expr/mod.rs) pattern-type check widened to accept LargeUtf8 and Utf8View literals (previously rejected even though the underlying regex match supports them).

Are these changes tested?

Yes:

  • Unit test similar_to_pattern_translation in binary.rs covers wildcards, anchoring, regex metas, literal ./^/$, backslash escapes, bracket expressions (including []abc] and [^]abc]).
  • datafusion/sqllogictest/test_files/strings.slt has a new regression block exercising the bug-report case, _ wildcard, anchoring, literal ./^/$, regex metas (|, {m,n}, +), backslash-escaped wildcards, NULL pattern, and the non-literal-pattern error.
  • The existing SIMILAR TO 'p[12].*' test in strings.slt was relying on the buggy regex-passthrough behavior (p1e1 etc. matched only because . was treated as regex .); it's been changed to 'p[12]%' which expresses the same intent under correct SIMILAR TO semantics.

Are there any user-facing changes?

Yes — SIMILAR TO is now PostgreSQL-compatible:

  • Behavior changes for patterns containing %, _, or unanchored matches. Existing patterns that relied on the previous regex-passthrough behavior may need to be updated (most obviously, change .* to %).
  • Non-literal patterns now return Not yet implemented: SIMILAR TO with a non-literal pattern is not yet supported instead of silently returning a regex match. This was almost certainly broken in practice already, but it is a visible error message change.

`SIMILAR TO` was lowered directly to a regex match without translating
SQL wildcards or anchoring the pattern, so `'abc' SIMILAR TO 'a%'`
returned false instead of true and `.`/`^`/`$` were treated as regex
metacharacters instead of literals.

The planner now translates literal `SIMILAR TO` patterns into an
equivalent POSIX regex (anchored with `^...$`, `%`→`.*`, `_`→`.`,
literal `.`/`^`/`$` escaped, backslash escape and bracket expressions
preserved) before lowering to the existing regex-match operator. NULL
patterns flow through as a typed Utf8 null. Non-literal patterns now
return a clear `not_impl` error rather than a silently wrong result,
and the SQL-layer pattern-type check is widened to accept `LargeUtf8`
and `Utf8View` literals.

Closes apache#22263.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added sql SQL Planner physical-expr Changes to the physical-expr crates sqllogictest SQL Logic Tests (.slt) labels May 16, 2026

@kosiew kosiew left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dandandan
I think the direction makes sense, but I found two blocking issues that need to be addressed before this is ready.

// Backslash escapes the next character in SIMILAR TO. Emit it
// as a literal in the regex by re-escaping it.
match chars.next() {
Some(next) => {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is still treating escaped characters as regex escapes rather than regex literals.

The new invariant says that \X in SIMILAR TO should mean a literal X, but this code turns the pattern \d into the regex ^\d$. That matches a digit and does not match d.

I can reproduce this on the branch with:

SELECT '5' SIMILAR TO '\d', 'd' SIMILAR TO '\d';

This returns true, false, but I would expect false, true.

Could we escape the next character as a regex literal instead, for example by applying regex escaping to that single character, rather than always prefixing it with \?

| ScalarValue::Null,
m,
) => Expr::Literal(ScalarValue::Utf8(None), m.clone()),
_ => {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it only fixes literal patterns, while non-literal SIMILAR TO patterns are now rejected during physical planning.

That seems like a SQL-visible regression, since SIMILAR TO should accept pattern expressions. For example, this should still be evaluable rather than failing during planning:

SELECT s FROM test WHERE s SIMILAR TO s;

Could we move the translation to a layer that can handle runtime pattern values, such as a physical expression or scalar kernel that translates the pattern column before regex matching? Alternatively, please preserve the currently supported non-literal cases while still enforcing SQL SIMILAR TO semantics end to end.

@Dandandan Dandandan marked this pull request as draft May 23, 2026 12:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-expr Changes to the physical-expr crates sql SQL Planner sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PostgreSQL compatibility: SIMILAR TO should treat % as a wildcard

2 participants